20 research outputs found

    GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors

    Get PDF
    open6siembargoed_20220427Bruschi, Nazareno; Haugou, Germain; Tagliavini, Giuseppe; Conti, Francesco; Benini, Luca; Rossi, DavideBruschi, Nazareno; Haugou, Germain; Tagliavini, Giuseppe; Conti, Francesco; Benini, Luca; Rossi, David

    An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

    Full text link
    Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

    Supporting localized openvx kernel execution for efficient computer vision application development on sthorm many-core platform

    No full text
    Nowadays Embedded Computer Vision (ECV) is considered a technology enabler for next generation killer apps, and scientific and industrial communities are showing a growing interest in developing applications on high-end embedded systems. Modern many-core accelerators are a promising target for running common ECV algorithms, since their architectural features are particularly suitable in terms of data access patterns and program control flow. In this work we propose a set of software optimization techniques, mainly based on data tiling and local buffering policies, which are specifically targeted to accelerate the execution of OpenVX-based ECV applications by exploiting the memory hierarchy of STHORM many-core accelerator

    Optimizing memory bandwidth in OpenVX graph execution on embedded many-core accelerators

    No full text
    Computer vision and computational photography are hot applications areas for mobile and embedded computing platforms. As a consequence, many-core accelerators are being developed to efficiently execute highly-parallel image processing kernels. However, power and cost constraints impose hard limits on the main memory bandwidth available, and push for software optimizations which minimize the usage of large frame buffers to store the intermediate results of multi-kernel applications. In this work we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution on cluster-based many-core accelerators of image processing applications expressed as standard OpenVX graphs. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator prototype demonstrate that our approach leads to massive reductions of main memory related stall time even when the main memory bandwidth available to the accelerator is severely constrained

    Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

    No full text
    In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained

    \u3bcDMA: An autonomous I/O subsystem for IoT end-nodes

    No full text
    The Internet of Things revolution requires long-battery-lifetime, autonomous end-nodes capable of probing the environment from multiple sensors and transmit it wirelessly after data-fusion, recognition, and classification. Duty-cycling is a well-known approach to extend battery lifetime: it allows to keep the hardware resources of the micro-controller implementing the end-node (MCUs) in sleep mode for most of the time, and activate them on-demand only during data acquisition or processing phases. To this end, most advanced MCUs feature autonomous I/O subsystems able to acquire data from multiple sensors when the CPU is in idle state. However, in these traditional I/O subsystems the interconnect is shared with the processing resources of the system, both converging in a single-port system memory. Moreover, both I/O and the data processing subsystems stand in some power domain. In this work we overcome the bandwidth and power-management limitations of current MCU\u2019s I/O architectures introducing an autonomous I/O subsystem coupling an I/O DMA tightly-coupled with a multi-banked system memory controlled by a tiny CPU, which stands on a dedicated power domain. The proposed architecture achieves a transfer efficiency of 84% when considering only data transfers, and 53% if we consider also the overhead of the runtime running on the controlling processor, reducing the operating frequency of the I/O subsystem by up to 2.2x with respect to traditional MCU architectures

    ADRENALINE: An OpenVX Environment to Optimize Embedded Vision Applications on Many-core Accelerators

    Get PDF
    The acceleration of Computer Vision algorithms is an important enabler to support the more and more pervasive applications of the embedded vision domain. Heterogeneous systems featuring a clustered many-core accelerator are a very promising target for embedded vision workloads, but the code optimization for these platforms is a challenging task. In this work we introduce ADRENALINE, a novel framework for fast prototyping and optimization of OpenVX applications for heterogeneous SoCs with many-core accelerators. ADRENALINE consists of an optimized OpenVX run-time system and a virtual platform, and it is intended to provide support to a wide range of end users. We highlight the benefits of this approach in different optimization contexts

    Exploring multi-banked shared-l1 program cache on ultra-low power, tightly coupled processor clusters

    No full text
    L1 instruction caches in many-core systems represent a siz-able fraction of the total power consumption. Although large instruction caches can significantly improve performance, they have the potential to increase power consumption. Pri-vate caches are usually able to achieve higher speed, due to their simpler design, but the smaller L1 memory space seen by each core induces a high miss ratio. Shared instruction cache can be seen as an attractive solution to improve per-formance and energy efficiency while reducing area. In this paper we propose a multi-banked, shared instruction cache architecture suitable for ultra-low power multicore systems, where parallelism and near threshold operation is used to achieve minimum energy. We implemented the cluster ar-chitecture with different configurations of cache sharing, uti-lizing the 28nm UTBB FD-SOI from STMicroelectronics as reference technology. Experimental results, based on several real-life applications, demonstrate that sharing mechanisms have no impact on the system operating frequency, and al-low to reduce the energy consumption of the cache subsys-tem by up to 10%, while keeping the same area footprint, or reducing by 2x the overall shared cache area, while keeping the same performance and energy efficiency with respect to a cluster of processing elements with private program caches
    corecore